54 research outputs found

    A hybrid algorithm for Bayesian network structure learning with application to multi-label learning

    Get PDF
    We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of fit to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC's ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets covering different application domains. Overall, our experiments support the conclusions that local structural learning with H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learning framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for the empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author

    Different aspects for clustering the self-organizing maps

    No full text
    International audienceSelf-Organizing Map (SOM) is an artificial neural network tool that is trained using unsupervised learning to produce a low dimensional representation of the input space, called a map. This map is generally the object of a clustering analysis step which aims to partition the referents vectors (map neurons) into compact and well-separated groups. In this paper, we consider the problem of the clustering self-organizing map using different aspects: partitioning, hierarchical and graph coloring based techniques. Unlike the traditional clustering SOM techniques, which use k-means or hierarchical clustering, the graph-based approaches have the advantage of providing a partitioning of the self-organizing map by simultaneously using dissimilarities and neighborhood relations provided by the map. We present the experimental results of several comparisons between these different ways of clustering

    Classification et prévision des données hétérogènes (application aux trajectoires et séjours hospitaliers)

    No full text
    Ces dernières années ont vu le développement des techniques de fouille de données dans de nombreux domaines d applications dans le but d analyser des données volumineuses et complexes. La santé est ainsi un secteur où les données disponibles sont nombreuses et de natures variées (variables classiques comme l âge ou le sexe, variables symboliques comme l ensemble des actes médicaux, les diagnostics, etc). D une manière générale, la fouille de données regroupe l ensemble des techniques soit descriptives (qui visent à mettre en évidence des informations présentes mais cachées par le volume des données), soit prédictives (cherchant à extrapoler de nouvelles connaissances à partir des informations présentes dans les données). Dans le cadre de cette thèse, nous nous intéressons au problème de classification et de prévision de données hétérogènes, que nous proposons d'étudier à travers deux approches principales. Dans la première, il s'agit de mettre en place une nouvelle approche de classification automatique basée sur une technique de la théorie des graphes baptisée b coloration. Nous avons également développé l apprentissage incrémental associé à cette approche, ce qui permet à de nouvelles données d être automatiquement intégrées dans la partition initialement générée sans avoir à relancer la classification globale. Le deuxième apport de notre travail concerne l analyse de données séquentielles. Nous proposons de combiner la méthode de classification précédente avec les modèles de mélange markovien, afin d obtenir une partition de séquences temporelles en groupes homogènes et significatifs. Le modèle obtenu assure une bonne interprétabilité des classes construites et permet d autre part d estimer l évolution des séquences d une classe donnée. Ces deux propositions ont ensuite été appliquées aux données issues du système d information hospitalier français (PMSI), dans l esprit d une aide au pilotage stratégique des établissements de soins. Ce travail consiste à proposer dans un premier temps une typologie plus fine des séjours hospitaliers pour remédier aux problèmes associés à la classification existante en groupes homogènes de malades (GHM). Dans un deuxième temps, nous avons cherché à définir une typologie des trajectoires patient (succession de séjours hospitaliers d un même patient) afin de prévoir de manière statistiques les caractéristiques du prochain séjour d un patient arrivant dans un établissement de soins. La méthodologie globale offre ainsi un environnement d aide à la décision pour le suivi et la maîtrise de l organisation du système des soins.Recent years have seen the development of data mining techniques in various application areas, with the purpose of analyzing large and complex data. The medical field is one of these areas where available data are numerous and described using various attributes, classical (like patient age and sex) or symbolic (like medical treatments and diagnosis). Data mining generally includes either descriptive techniques (which provide an attractive mechanism to automatically find the hidden structure of large data sets), or predictive techniques (able to unearth hidden knowledge from datasets). In this work, the problem of clustering and prediction of heterogeneous data is tackled by a two stage proposal. The first one concerns a new clustering approach which is based on a graph coloring method, named b coloring. An extension of this approach which concerns incremental clustering has been added at the same time. It consists in updating clusters as new data are added to the dataset without having to perform complete re clustering. The second proposal concerns sequential data analysis and provides a new framework for clustering sequential data based on a hybrid model that uses the previous clustering approach and the Mixture Markov chain models. This method allows building a partition of the sequential dataset into cohesive and easily interpretable clusters, as well as it is able to predict the evolution of sequences from one cluster. Both proposals have then been applied to healthcare data given from the PMSI program (French hospital information system), in order to assist medical professionals in their decision process. In the first step, the b coloring clustering algorithm has been investigated to provide a new typology of hospital stays as an alternative to the DRGs classification (Diagnosis Related Groups). In a second step, we defined a typology of clinical pathways and are then able to predict possible features of future paths when a new patient arrives at the clinical center. The overall framework provides a decision aid system for assisting medical professionals in the planning and management of clinical processLYON1-BU.Sciences (692662101) / SudocSudocFranceF

    Unsupervised Feature Selection with Ensemble Learning

    No full text
    International audienceIn this paper, we show that the way internal estimates are used to measure variable importance in Random Forests are also applicable to feature selection in unsupervised learning. We propose a new method called Random Cluster Ensemble (RCE for short), that estimates the out-of-bag feature importance from an ensemble of partitions. Each partition is constructed using a different bootstrap sample and a random subset of the features. We provide empirical results on nineteen benchmark data sets indicating that RCE, boosted with a recursive feature elimination scheme (RFE), can lead to significant improvement in terms of clustering accuracy, over several state-of-the-art supervised and unsupervised algorithms, with a very limited subset of features. The method shows promise to deal with very large domains. All results, datasets and algorithms are available on line

    Unsupervised Feature Selection with Ensemble Learning

    No full text
    International audienceIn this paper, we show that the way internal estimates are used to measure variable importance in Random Forests are also applicable to feature selection in unsupervised learning. We propose a new method called Random Cluster Ensemble (RCE for short), that estimates the out-of-bag feature importance from an ensemble of partitions. Each partition is constructed using a different bootstrap sample and a random subset of the features. We provide empirical results on nineteen benchmark data sets indicating that RCE, boosted with a recursive feature elimination scheme (RFE), can lead to significant improvement in terms of clustering accuracy, over several state-of-the-art supervised and unsupervised algorithms, with a very limited subset of features. The method shows promise to deal with very large domains. All results, datasets and algorithms are available on line

    An Empirical Comparison of Supervised Ensemble Learning Approaches

    No full text
    International audienceWe present an extensive empirical comparison between twenty prototypical supervised ensemble learning algorithms, including Boosting, Bagging, Random Forests, Rotation Forests, Arc-X4, Class-Switching and their variants, as well as more recent techniques like Random Patches. These algorithms were compared against each other in terms of threshold, ranking/ordering and probability metrics over nineteen UCI benchmark datasets with binary labels. We also examine the influence of two base learners, CART and Extremely Randomized Trees, and the effect of calibrating the models via Isotonic Regression on each performance metric. The selected datasets were already used in various empirical studies and cover different application domains. The experimental analysis was restricted to the hundred most relevant features according to the SNR filter method with a view to dramatically reducing the computational burden involved by the simulation. The source code and the detailed results of our study are publicly available
    • …
    corecore